SCIPORT - 2012 - Annual activity report

SCIPORT

SCIPORT - 2012

Team Sciport

Members

Overall Objectives

Scientific Foundations

Application Domains

Software

New Results

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: Scientific Foundations

Automatic Differentiation

Participants : Laurent Hascoët, Valérie Pascual.

automatic differentiation: (AD) Automatic transformation of a program, that returns a new program that computes some derivatives of the given initial program, i.e. some combination of the partial derivatives of the program's outputs with respect to its inputs.
adjoint: Mathematical manipulation of the Partial Derivative Equations that define a problem, obtaining new differential equations that define the gradient of the original problem's solution.
checkpointing: General trade-off technique, used in adjoint-mode AD, that trades duplicate execution of a part of the program to save some memory space that was used to save intermediate results. Checkpointing a code fragment amounts to running this fragment without any storage of intermediate values, thus saving memory space. Later, when such an intermediate value is required, the fragment is run a second time to obtain the required values.

Automatic or Algorithmic Differentiation (AD) differentiates programs. An AD tool takes as input a source program $P$ that, given a vector argument $X \in I R^{n}$ , computes some vector result $Y = F (X) \in I R^{m}$ . The AD tool generates a new source program $P^{'}$ that, given the argument $X$ , computes some derivatives of $F$ . The resulting $P^{'}$ reuses the control of $P$ .

For any given control, $P$ is equivalent to a sequence of instructions, which is identified with a composition of vector functions. Thus, if

\begin{matrix} \begin{matrix} P & is & {I_{1}; I_{2}; \dots I_{p};}, \\ F & = & f_{p} \circ f_{p - 1} \circ \dots \circ f_{1}, \end{matrix} \end{matrix}

(1)

where each $f_{k}$ is the elementary function implemented by instruction $I_{k}$ . AD applies the chain rule to obtain derivatives of $F$ . Calling $X_{k}$ the values of all variables after instruction $I_{k}$ , i.e. $X_{0} = X$ and $X_{k} = f_{k} (X_{k - 1})$ , the chain rule gives the Jacobian of $F$

F^{'} (X) = f_{p}^{'} (X_{p - 1}) . f_{p - 1}^{'} (X_{p - 2}) . \dots . f_{1}^{'} (X_{0})

(2)

which can be mechanically written as a sequence of instructions $I_{k}^{'}$ . Combining the $I_{k}^{'}$ with the control of $P$ yields $P^{'}$ . This can be generalized to higher level derivatives, Taylor series, etc.

In practice, the Jacobian $F^{'} (X)$ is often too expensive to compute and store, but most applications only need projections of $F^{'} (X)$ such as:

Sensitivities, defined for a given direction $\dot{X}$ in the input space as:

$F^{'} (X) . \dot{X} = f_{p}^{'} (X_{p - 1}) . f_{p - 1}^{'} (X_{p - 2}) . \dots . f_{1}^{'} (X_{0}) . \dot{X} .$ (3)

Sensitivities are easily computed from right to left, interleaved with the original program instructions. This is the tangent mode of AD.
Adjoints, defined for a given weighting $\bar{Y}$ of the outputs as:

$F^{' *} (X) . \bar{Y} = f_{1}^{' *} (X_{0}) . f_{2}^{' *} (X_{1}) . \dots . f_{p - 1}^{' *} (X_{p - 2}) . f_{p}^{' *} (X_{p - 1}) . \bar{Y} .$ (4)

Adjoints are most efficiently computed from right to left, because matrix $\times$ vector products are cheaper than matrix $\times$ matrix products. This is the adjoint mode of AD, most effective for optimization, data assimilation [33] , adjoint problems [28] , or inverse problems.

Adjoint-mode AD turns out to make a very efficient program, at least theoretically [30] . The computation time required for the gradient is only a small multiple of the run-time of $P$ . It is independent from the number of parameters $n$ . In contrast, computing the same gradient with the tangent mode would require running the tangent differentiated program $n$ times.

However, the $X_{k}$ are required in the inverse of their computation order. If the original program overwrites a part of $X_{k}$ , the differentiated program must restore $X_{k}$ before it is used by $f_{k + 1}^{' *} (X_{k})$ . Therefore, the central research problem of adjoint-mode AD is to make the $X_{k}$ available in reverse order at the cheapest cost, using strategies that combine storage, repeated forward computation from available previous values, or even inverted computation from available later values.

Another research issue is to make the AD model cope with the constant evolution of modern language constructs. From the old days of Fortran77, novelties include pointers and dynamic allocation, modularity, structured data types, objects, vectorial notation and parallel communication. We regularly extend our models and tools to handle these new constructs.

Previous |

Home | Next next